Skip to main content
Version: 25.10 (Latest)

Rule Building and Tuning Guide

This guide provides practical advice for building, testing, and tuning Content Identification rules to ensure they provide accurate detection with minimal false positives. Use this guide alongside the reference documentation to create effective detection rules.

Getting Started

Understanding Your Requirements

Before building rules, clearly define your detection objectives:

  1. What data to detect: Specific data types (SSN, credit cards, etc.)
  2. Where it appears: File types, applications, communication channels
  3. Accuracy requirements: Acceptable false positive/negative rates
  4. Performance constraints: Processing time and resource limitations

Planning Your Approach

  1. Start with existing rules: Review predefined rules for similar use cases
  2. Gather sample data: Collect representative content for testing
  3. Define success criteria: Set measurable goals for accuracy and performance
  4. Plan iterative development: Build, test, refine in cycles

Building Your First Rule

Step 1: Create a Basic Rule Pack

Start with a simple rule pack structure:

<?xml version="1.0" encoding="UTF-8"?>
<RulePackage xmlns="http://schemas.microsoft.com/office/2011/mce">
<RulePack id="my-custom-rules">
<Version major="1" minor="0" build="0" revision="0"/>
<Publisher id="my-organization"/>
<Details defaultLangCode="en">
<LocalizedDetails langcode="en">
<PublisherName>My Organization</PublisherName>
<Name>Custom Detection Rules</Name>
<Description>Custom rules for detecting sensitive data</Description>
</LocalizedDetails>
</Details>

<Rules>
<!-- Rules will go here -->
</Rules>

<Resources>
<!-- Shared resources will go here -->
</Resources>
</RulePack>
</RulePackage>

Step 2: Define Shared Resources

Create reusable resources for keywords and patterns:

<Resources>
<!-- Keywords for financial terms -->
<Keyword id="financial-keywords">
<Group matchStyle="word">
<Term>account</Term>
<Term>balance</Term>
<Term>payment</Term>
<Term>transaction</Term>
</Group>
</Keyword>

<!-- Pattern for account numbers -->
<Regex id="account-number-pattern">
<Pattern>\b\d{8,12}\b</Pattern>
</Regex>
</Resources>

Step 3: Create a Simple Detection Rule

Start with a basic entity rule:

<Rules>
<Entity id="bank-account-detection" patternsProximity="300" recommendedConfidence="75">
<Pattern confidenceLevel="85">
<IdMatch idRef="account-number-pattern"/>
<Match idRef="financial-keywords"/>
</Pattern>
</Entity>
</Rules>

Testing and Validation

Creating Test Content

Develop comprehensive test content that includes:

  1. Positive samples: Content that should match your rules
  2. Negative samples: Similar content that should not match
  3. Edge cases: Boundary conditions and unusual formats
  4. Real-world samples: Actual content from your environment

Test Content Examples

Positive Test Cases:

Account number: 123456789
Payment to account 987654321
Transaction for account #555666777

Negative Test Cases:

Phone number: 123456789
Order number: 987654321
Reference ID: 555666777

Edge Cases:

Account: 12345678 (minimum length)
Account: 123456789012 (maximum length)
Acct 123-456-789 (with formatting)

Testing Methodology

  1. Unit Testing: Test individual patterns and keywords
  2. Integration Testing: Test complete rules with all components
  3. Performance Testing: Measure processing time with large content
  4. Accuracy Testing: Calculate precision and recall metrics

Measuring Accuracy

Calculate key metrics to assess rule performance:

  • Precision: True Positives / (True Positives + False Positives)
  • Recall: True Positives / (True Positives + False Negatives)
  • F1 Score: 2 × (Precision × Recall) / (Precision + Recall)

Tuning for Better Accuracy

Reducing False Positives

Problem: Rules match unintended content

Solutions:

  1. Add Context Keywords: Require supporting evidence

    <Pattern confidenceLevel="85">
    <IdMatch idRef="number-pattern"/>
    <Any minMatches="1">
    <Match idRef="financial-keywords"/>
    <Match idRef="banking-keywords"/>
    </Any>
    </Pattern>
  2. Use Exclusion Patterns: Filter out known false positives

    <Pattern confidenceLevel="80">
    <IdMatch idRef="ssn-pattern"/>
    <Match idRef="personal-context"/>
    <Not>
    <Match idRef="test-data-keywords"/>
    </Not>
    </Pattern>
  3. Adjust Proximity Settings: Reduce distance between patterns

    <Entity id="precise-detection" patternsProximity="150">
    <!-- Patterns must be closer together -->
    </Entity>
  4. Increase Confidence Thresholds: Require higher confidence

    <Pattern confidenceLevel="90"> <!-- Increased from 75 -->
    <IdMatch idRef="validated-pattern"/>
    <Match idRef="strong-context"/>
    </Pattern>

Reducing False Negatives

Problem: Rules miss legitimate sensitive content

Solutions:

  1. Add Alternative Patterns: Cover different formats

    <Entity id="comprehensive-detection">
    <Pattern confidenceLevel="90">
    <IdMatch idRef="formatted-pattern"/>
    <Match idRef="context-keywords"/>
    </Pattern>
    <Pattern confidenceLevel="75">
    <IdMatch idRef="unformatted-pattern"/>
    <Any minMatches="2">
    <Match idRef="context-keywords"/>
    <Match idRef="supporting-keywords"/>
    </Any>
    </Pattern>
    </Entity>
  2. Expand Keyword Lists: Include synonyms and variations

    <Keyword id="expanded-financial-terms">
    <Group matchStyle="word">
    <Term>account</Term>
    <Term>acct</Term>
    <Term>account number</Term>
    <Term>account #</Term>
    <Term>bank account</Term>
    <Term>checking</Term>
    <Term>savings</Term>
    </Group>
    </Keyword>
  3. Use Broader Patterns: Include more variations

    <Regex id="flexible-ssn-pattern">
    <Pattern>\b\d{3}[-.\s]?\d{2}[-.\s]?\d{4}\b</Pattern>
    </Regex>
  4. Lower Confidence Thresholds: Accept lower confidence matches

    <Pattern confidenceLevel="65"> <!-- Decreased from 75 -->
    <IdMatch idRef="broad-pattern"/>
    <Match idRef="weak-context"/>
    </Pattern>

Advanced Tuning Techniques

Multi-Pattern Rules

Create rules with multiple patterns for different scenarios:

<Entity id="credit-card-comprehensive" patternsProximity="300" recommendedConfidence="80">
<!-- High confidence: validated format with strong context -->
<Pattern confidenceLevel="95">
<IdMatch idRef="Func_credit_card_formatted"/>
<Any minMatches="1">
<Match idRef="credit-card-keywords"/>
<Match idRef="payment-keywords"/>
</Any>
</Pattern>

<!-- Medium confidence: pattern with multiple context clues -->
<Pattern confidenceLevel="80">
<IdMatch idRef="credit-card-regex"/>
<Any minMatches="2">
<Match idRef="credit-card-keywords"/>
<Match idRef="payment-keywords"/>
<Match idRef="financial-keywords"/>
</Any>
</Pattern>

<!-- Lower confidence: pattern with strong context -->
<Pattern confidenceLevel="70">
<IdMatch idRef="number-pattern"/>
<Any minMatches="3">
<Match idRef="visa-keywords"/>
<Match idRef="mastercard-keywords"/>
<Match idRef="payment-context"/>
<Match idRef="financial-context"/>
</Any>
</Pattern>
</Entity>

Contextual Tuning

Adjust rules based on content context:

<!-- Rule for structured forms -->
<Entity id="form-ssn-detection" patternsProximity="100">
<Pattern confidenceLevel="90">
<IdMatch idRef="Func_ssn_formatted"/>
<Match idRef="form-keywords"/>
</Pattern>
</Entity>

<!-- Rule for unstructured documents -->
<Entity id="document-ssn-detection" patternsProximity="400">
<Pattern confidenceLevel="85">
<IdMatch idRef="Func_ssn_formatted"/>
<Any minMatches="2">
<Match idRef="personal-keywords"/>
<Match idRef="government-keywords"/>
<Match idRef="identity-keywords"/>
</Any>
</Pattern>
</Entity>

Language-Specific Tuning

Create localized versions for different languages:

<Keyword id="financial-terms-multilingual">
<Group matchStyle="word" langcode="en">
<Term>account</Term>
<Term>payment</Term>
<Term>balance</Term>
</Group>
<Group matchStyle="word" langcode="es">
<Term>cuenta</Term>
<Term>pago</Term>
<Term>saldo</Term>
</Group>
<Group matchStyle="word" langcode="fr">
<Term>compte</Term>
<Term>paiement</Term>
<Term>solde</Term>
</Group>
</Keyword>

Performance Optimization

Pattern Optimization

  1. Use Anchored Regex: Include word boundaries

    <!-- Good: Uses word boundaries -->
    <Pattern>\b\d{3}-\d{2}-\d{4}\b</Pattern>

    <!-- Avoid: No anchoring -->
    <Pattern>\d{3}-\d{2}-\d{4}</Pattern>
  2. Avoid Backtracking: Use non-capturing groups

    <!-- Good: Non-capturing group -->
    <Pattern>\b(?:\d{4}[-\s]?){3}\d{4}\b</Pattern>

    <!-- Avoid: Capturing group -->
    <Pattern>\b(\d{4}[-\s]?){3}\d{4}\b</Pattern>
  3. Optimize Quantifiers: Be specific about repetition

    <!-- Good: Specific repetition -->
    <Pattern>\b\d{3}-\d{2}-\d{4}\b</Pattern>

    <!-- Avoid: Greedy quantifier -->
    <Pattern>\b\d+-\d+-\d+\b</Pattern>

Rule Ordering

Order patterns by selectivity (most specific first):

<Entity id="optimized-rule">
<!-- Most selective pattern first -->
<Pattern confidenceLevel="95">
<IdMatch idRef="highly-specific-pattern"/>
</Pattern>

<!-- Less selective patterns follow -->
<Pattern confidenceLevel="80">
<IdMatch idRef="moderately-specific-pattern"/>
<Match idRef="context-keywords"/>
</Pattern>

<!-- Least selective pattern last -->
<Pattern confidenceLevel="65">
<IdMatch idRef="broad-pattern"/>
<Any minMatches="3">
<Match idRef="context1"/>
<Match idRef="context2"/>
<Match idRef="context3"/>
</Any>
</Pattern>
</Entity>

Resource Management

  1. Limit Keyword Lists: Keep lists under 1000 terms
  2. Share Resources: Reuse common patterns and keywords
  3. Optimize Proximity: Use smallest effective values
  4. Monitor Memory: Track resource usage during testing

Troubleshooting Common Issues

Issue: Rule Not Matching Expected Content

Diagnosis Steps:

  1. Verify pattern syntax with regex testing tools
  2. Check keyword spelling and case sensitivity
  3. Validate proximity settings are appropriate
  4. Ensure context elements are present in test content

Solutions:

  • Test patterns in isolation
  • Add debug logging to identify failure points
  • Review evaluation context for missing elements
  • Adjust proximity values incrementally

Issue: Too Many False Positives

Diagnosis Steps:

  1. Analyze false positive samples
  2. Identify common characteristics
  3. Review confidence thresholds
  4. Check for missing exclusion patterns

Solutions:

  • Add exclusion keywords for common false positives
  • Increase confidence requirements
  • Add more specific context requirements
  • Reduce proximity values

Issue: Performance Problems

Diagnosis Steps:

  1. Profile rule execution times
  2. Identify slow patterns
  3. Check for regex backtracking
  4. Monitor resource usage

Solutions:

  • Optimize regex patterns
  • Reduce pattern complexity
  • Order patterns by performance
  • Consider rule splitting

Deployment and Monitoring

Staged Deployment

  1. Development: Test with sample content
  2. Staging: Test with production-like data
  3. Limited Production: Deploy to subset of content
  4. Full Production: Deploy to all content streams

Monitoring Metrics

Track these key metrics in production:

  • Match Rate: Detections per unit of content
  • Confidence Distribution: Spread of confidence levels
  • Performance: Processing time per rule
  • False Positive Rate: User-reported incorrect matches

Continuous Improvement

  1. Regular Review: Assess rule performance monthly
  2. Feedback Integration: Incorporate user feedback
  3. Pattern Updates: Keep patterns current with new threats
  4. Performance Monitoring: Watch for degradation over time

Best Practices Summary

Rule Design

  • Start simple and add complexity gradually
  • Use multiple patterns with different confidence levels
  • Include comprehensive context keywords
  • Test with diverse, representative content

Performance

  • Optimize regex patterns for efficiency
  • Order patterns by selectivity
  • Use appropriate proximity values
  • Monitor resource usage

Maintenance

  • Version control all rule changes
  • Document rule logic and intent
  • Regular performance reviews
  • Keep keyword lists current

Testing

  • Create comprehensive test suites
  • Include positive, negative, and edge cases
  • Measure accuracy with precision/recall
  • Test performance with large content samples